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^^ Abstract. We introduce streaming data string transducers that map 

Cn input data strings to output data strings in a single left-to-right pass 

^ in linear time. Data strings are (unbounded) sequences of data values, 

D tagged with symbols from a finite set, over a potentially infinite data 

P^H domain that supports only the operations of equality and ordering. The 

I transducer uses a finite set of states, a finite set of variables ranging over 

I the data domain, and a finite set of variables ranging over data strings. 

At every step, it can make decisions based on the next input symbol, 

I I updating its state, remembering the input data value in its data vari- 

l_^ ables, and updating data-string variables by concatenating data-string 

Qi^ variables and new symbols formed from data variables, while avoiding 

ij^ duplication. We establish that the problems of checking functional equiv- 

O alence of two streaming transducers, and of checking whether a streaming 

transducer satisfies pre/post verification conditions specified by stream- 
ing acceptors over input/output data-strings, are in Pspace. 

►^ We identify a class of imperative and a class of functional programs, 

QQ manipulating lists of data items, which can be effectively translated to 

ly^ streaming data-string transducers. The imperative programs dynami- 

^\ cally modify a singly-linked heap by changing next-pointers of heap- 

■^^ nodes and by adding new nodes. The main restriction specifies how the 

i~ ^ next-pointers can be used for traversal. We also identify an expressively 

^-^ equivalent fragment of functional programs that traverse a list using 

(*^ syntactically restricted recursive calls. Our results lead to algorithms for 

1— H assertion checking and for checking functional equivalence of two pro- 

IL/ grams, written possibly in different programming styles, for commonly 

. _H used routines such as insert, delete, and reverse. 
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"^ 1 Introduction 



We propose streaming transducers as an abstract and analyzable model for pro- 
grams that access and modify sequences of data items in a single pass. The idea 
of using transducers for modeling such programs is a natural one. However, the 
class of regular transductions, which has appealing theoretical properties such as 
MSO characterization, is defined by two-way transducers. As an example, con- 
sider the reverse transduction that reverses the input string. It is not definable 
using classical one-way transducers (such as Mealy machines) , we need a second 
(backward) pass during which the output is produced. On the other hand, this 



transduction can easily be computed by a single-pass program that traverses 
a list both in the settings of imperative programs manipulating heap-allocated 
lists and functional programs using tail recursion. Our streaming transducer 
model can capture such computations naturally. Furthermore, we show that ver- 
ification problems such as checking functional equivalence, assertion checking, 
and checking correctness with respect to pre/post conditions, are decidable for 
this transducer model, even in the presence of values from an unbounded data 
domain. 

In the proposed model, a (deterministic) streaming data-string transducers 
map input data strings to output data strings in a single left-to-right pass in 
linear time. A data string is a sequence of items of type data x tag, where data 
is a potentially infinite set of data values, and tag is a finite set of labels. The 
only operations allowed on the type data are tests for equality and ordering, 
and this restriction is essential for decidability. The transducer uses a finite set 
of states, a finite set of variables ranging over data, and a finite set of variables 
ranging over data strings. At every step, it can make decisions based on the 
current state, the tag of the next input symbol, and the ordering relationship of 
the data of the next input symbol with the data values stored in data variables. 
It can update the state, modify data variables using the input data value, and 
update the data-string variables using assignments whose right-hand-sides are 
concatenations of data-string variables and new symbols formed using data vari- 
ables. A key restriction is that a data-string variable can be used at most once 
in a right-hand-side expression at each step. Multiple data-string variables are 
necessary for the transducer to compute different possible chunks of the output, 
and the restriction on how they can be used ensures that at every step there is 
merely a rearrangement of outputs computed so far without duplication. 

We consider the following two decision problems for streaming transducers: 
(1) Equivalence: given two streaming transducers, do they define the same (par- 
tial) functions? (2) Pre-post condition checking: given a streaming transducer, a 
pre-condition and a post-condition, both expressed by similarly defined stream- 
ing acceptors over data strings, is it the case that for every input satisfying the 
pre-condition, the corresponding transducer output satisfies the post-condition? 
We show both problems to be in Pspace. We also show that extending the 
model along several possible directions leads quickly to undecidability of basic 
problems such as reachability. 

We then identify a class of programs that precisely correspond to the stream- 
ing data-string transducers. The input to a program is a single list with elements 
of type data x tag, possibly with additional arguments of types data, tag, and 
bool, and the output is a single list, possibly with additional returned values 
of types data, tag, and bool. The key restriction, needed for decidability of 
verification problems, is that the program computes the output in a single pass 
processing the next item of the list at each step. A number of commonly used 
routines such as insertion, deletion, membership, reversal, sorting with respect 
to tags (but not data values) , naturally satisfy this restriction. 



For heap-manipulating imperative programs, the input hst is stored in a 
heap of nodes each of which can store a tag, a data value, and a next-node 
pointer. During the computation of the program, the next-pointers induce an 
unrankcd forest structure over the nodes. The program accesses the heap using 
a finite number of pointer variables and uses a finite number of data variables. 
The program can add new nodes to the heap, change values stored at nodes 
referenced by pointer variables, and also modify next-pointers of such nodes. A 
key restriction on the traversal of the heap using next-pointers is that the only 
legal use of the next-field on the right-hand-side is in the assignment curr := 
curr.next, where curr is the unique pointer initially pointing to the head of the 
input list. We also show that for this class of programs, a variety of assertion 
checking problems (such as "is a program location reachable," and "does the 
heap stay acyclic") are solvable in Pspace. 

Finally, we present a class of list-processing functional programs which tra- 
verse an input list from left to right using recursive calls. The key restriction is 
that a call to the function / with input list I can recursively call / with input 
tail 1, and returns value obtained by composing its input arguments and the 
values returned by the recursive call without examining them. We show that 
this class precisely corresponds to the streaming data string transducers. Thus, 
the results of this paper show how to automatically compile two list-processing 
programs, one written in imperative style, and one written in functional style, 
into an intermediate low-level transducer model, and algorithmically check if the 
two are semantically equivalent. 

2 Streaming transducers 

2.1 Data strings and transductions 

A data domain is a totally ordered, possibly infinite, set D of data values. We 
will use < to denote the strict total order over D. Throughout this paper, assume 
D to be fixed. A data symbol over S^ where Z" is a finite set of symbols or tags, is 
a pair (cr, d) with cr G Z" and d Cz D. A data string w over Z is a finite sequence 
((Ji, di), (ct2, c?2), • • • (cfc, dfe) of data symbols over S. A data language over S 
is a set L of data strings over S. A (deterministic) data transduction from an 
input alphabet S to an output alphabet _r is a partial function F from data 
strings over S to data strings over F. For a data transduction F from an input 
alphabet S to an output alphabet F, the domain of F is the data language over 
E consisting of data strings w such that F{w) is defined. 

As an example, let D be the set of strings over ASCII characters ordered by 
the standard lexicographic ordering. Let S contain two tags private and public. 
A data symbol denotes an entry in an address-book consisting of a name tagged 
either as private or public. A data string represents an address-book. Here are 
a few examples of data languages: language Li consists of all data strings in 
which names appear in the alphabetic order (that is, data symbols are sorted in 
an increasing order according to < over data values); and language L2 consists 
of all data strings that do not contain duplicate entries. A few examples of 



useful data transductions are: transduction Fi maps a data string to its reverse; 
transduction F2 maps a data string w to W1.W2, where wi is the subsequence of 
w containing private entries, and W2 is the subsequence of w containing pubhc 
entries; and transduction F3 deletes an entry {private,d) from the input data 
string if the string also contains {puhlic,d). To model operations such as insertion 
and deletion that take data values/symbols as inputs in addition to a data string, 
we can encode all inputs in a single data string. For example, the transduction 
i^4, given an input data string (a, d).^] checks if names appear in the alphabetic 
order in the tail w, and if so, it returns w with (a, d) inserted in the correct 
position to maintain the output string sorted (i^4 is undefined if the input string 
is empty, or if the tail of the input string is not sorted) . 

2.2 Transducer definition 

We now describe our model of deterministic transducers. The transducer reads 
an input data string Icft-to-right in a single pass, and computes an output data 
string. The transducer uses a finite set of states, a finite set of data variables 
that range over data values, and a finite set of data string variables that range 
over data strings over the output alphabet. At each step, the transducer reads 
the next data symbol of the input string, and chooses a transition depending 
on the current state, the tag of the input symbol, and the ordering relationship 
of the data value of the input symbol with values of all its data variables. The 
transition updates the state, updates the data variables possibly using the input 
data value, and updates the data string variables in parallel using assignments 
whose right-hand-sides are concatenations of data string variables and new data 
symbols formed using data variables. When the transducer consumes the entire 
input string, the final output string is produced by similarly concatenating data 
string and data variables. A key restriction is that a data string variable can be 
used at most once in a right-hand-side expression in a transition, and thus, at 
every step, there is merely a rearrangement of output chunks computed so far, 
without duplication. 

We now define the model formally. A (deterministic) streaming data-string 
transducer (SDST) S from an input alphabet S to an output alphabet F consists 
of a finite set of states Q, an initial state qq G Q, a. finite set of data variables 
V, a data variable curr € V used to refer to the data value of the current input 
symbol, a finite set of data string variables X, a partial output function O from 
Q to {{F X V) U X)* , a finite set E of transitions of the form {q,a,ip,q' ,a), 
where g G Q is a source state, cr G Z" is an input tag, </? is a Boolean formula 
over atomic constraints of the form v < curr and curr < v with v € V, q' € Q 
is a target state, and a is an assignment mapping data variables F to F and 
data string variables X to {(F xV)U X)* . It is required that (1) for each q G Q 
and X G X, there is at most one occurrence of x in 0{q), and (2) for each 
transition {q,a,(p,q' ,a), for each x G X, x appears at most once in the set of 
strings {a{y) \ y G X}, and (3) for each pair of transitions {q,a,(p,q' ,a) and 
(q, a, ip' , q", a') with the same source state and input tag, the tests (p and ip' are 
mutually exclusive (that is, ip A (p' is unsatisfiable) . 



A valuation j3 for such a transducer S is a partial function over data and data 
string variables such that for each data variable w e t^, either f3{v) is undefined 
or is a data value in D, and for each data string variable x & X, either I3{x) 
is undefined or is a data string over the output alphabet F. Such a valuation 
naturally extends to a partial Boolean function to evaluate tests. Each test (p is 
a Boolean formula over atomic constraints of the form v < curr and curr < v 
with V G V. The value I3{ip) is defined if /3{v) is defined for all data variables v 
occurring in (p, and if so, ip is evaluated using the values /3 assigns to these data 
variables. A valuation (3 also extends to strings in {{F x V) UX)*: given a string 
u in {{F X V)U X)* , the valuation f3{u) is defined when /3 is defined for all the 
data and data string variables occurring in u, and if so, /3{u) is the data string 
over the output alphabet F obtained by replacing each data string variable x in 
u with the data string /3{x) and each data variable w in u with the data value 

Given an SDST 5*, a configuration of S is a pair (q, /3), where g is a state in 
Q, and /3 is a valuation for S. The initial configuration is {qo,/3o) where qo is 
the initial state of S, l3o{'>^) is undefined for each data variable v, and Po{x) is 
the empty string for each data string variable x. The one-step transition relation 
over the set of configurations is defined as follows. Consider a configuration (q, 13) 
and an input data symbol (cr, d) . The transducer first updates the valuation /? 
to /?' by setting curr to the input data value d. Next, let {q,a,ip,q' ,a) be a 
transition such that /3' satisfies the test ip. If there is such a transition, then the 
updated state is q' , and the updated value of each data and data string variable 
x is obtained by evaluating the right-hand-side a{x) according to the valuation 
/3' . That is, if there exists a transition {q,a,(p,q' ,a) such that f3'{(p) = 1, for 

f3' — f3[curr h- > d], then (q, l3) -^ {q',P' • ol). Determinism ensures that each 
configuration has at most one successor for a given input data symbol. The 
transition relation extends to a multi-step relation over input data strings in 
the natural way. Given an input data string w over S, {qo, (3o) — > {q, /3) means 
that the configuration of the transducer after reading the input data string w is 
(g, /3) (if no such configuration exists that means no transition is enabled at some 
step). The semantics of S is then defined to be the transduction [S] defined as: 
for an input data string w over E, if (go, Po) — > (<Z, /?) and 0{q) is defined then 
I^Kw) is defined to be P{0{q)), otherwise |S'](w) is undefined. 

We call a data transduction F from an input alphabet S to an output al- 
phabet F to be streaming-regular if there exists an SDST 5* such that [S"] = F. 

2.3 Examples 

To illustrate our definition of transducers, let us consider the transductions men- 
tioned in Section 2.1. The transduction Fi to reverse the input data string can 
be implemented by a streaming data-string transducer Si with a single state, a 
single data variable curr, and a single data string variable x. The input tag a is 
processed by the self-loop transition with update x := {a, curr).x (by default, a 
variable that is not explicitly updated, remains unchanged; we omit such assign- 
ments for readability). The output function outputs x. No tests on input data 



[x. It] := u < curr — > 
[{curr,<j).curr\ /"^^ l^''?/] ■= [curr. x{ curr. a)] /• N l^ — '^'^fn 

* ) — ^ • ?! ) ^ ' ' •( «i O [«■!/] := 




\curr^ y{curr, a)] 



(u < curr) 
V < curr) — > 

["■ y] ■■= 

[curr. yx(curr. a)] 



(u > curr)^(v < curr) —. 
[v,y] :— [curr,y{curr,a)] 



Fig. 1. Transduction F4 



values are needed. Notice that classical definitions of transducers allow adding 
output symbols only at the end of the output computed so far. Adding a symbol 
to the front of the string variable x at each step is crucial to implement reverse 
in a single Icft-to-right pass. 

Now let us consider the transduction F2 that maps a data string w to W1.W2, 
where wi and W2 are the subsequences oiw containing private and public entries, 
respectively. This can be implemented by an SDST 6*2 that maintains two data 
string variables xi and X2, and a single data variable curr. At each step, if the 
tag of current input symbol is private, the symbol is added to Xi (the precise 
assignment is [a;i,a;2] := [xi. (private, curr),X2]), otherwise the symbol is added 
to X2 in a symmetric manner. The output function outputs the concatenation 
a;i.X2. Note that it is not possible to implement this transduction by an SDST 
using just one variable. 

The transduction F3 deletes a private entry from the input data string if the 
string also contains a public entry with a matching data value. When reading an 
input symbol with data value d, the streaming algorithm needs to figure out if a 
private entry with the same data value has been encountered before. An SDST 
with k variables can effectively use only k data values for tests at any step, and 
since the number of possible data values in an input string is unbounded, F^ is 
not streaming-regular. 

Consider the transduction F4 that inserts the head symbol of the input string 
in its tail, provided that the tail is sorted. The transducer 5*4 uses three data 
variables: u to remember the head data value, v to remember the previous data 
value, and curr to refer to the current data value. It uses a data string variable x 
to remember the first data symbol, and y to compute the output. The transducer 
is shown in Figure 1. The transducer is in state go initially, in state qi after 
reading one symbol, in state (72 after reading 2 or more symbols provided the 
tail so far is sorted and all its data values are smaller than the head data value 
(stored in u), and in state (73 after reading 2 or more symbols provided the tail 
so far is sorted and the head symbol has already been inserted in the output. 
In states 52 and q^, the variable v stores the previous data value, and the test 
V < curr checks for sortedness of the input {v gets updated to curr at each 



step). If this test does not hold, no transition is enabled, and the output will be 
undefined. The transition to ^3 inserts the data symbol stored in x in the output 
y. The output function is undefined in state qq, and is x in state qi, y.x in state 
q2, and y in state 33. 

2.4 Streaming acceptors 

A streaming data-string acceptor (SDSA) is a streaming data-string transducer 
S with an empty set of data string variables. Such an acceptor S has a finite 
set of data variables that can remember the data values from the input string, 
and can make decisions based on their relative ordering. The output alphabet 
plays no role in the behavior of an acceptor. Given an input data string w, cither 
the output [S'Kw) is defined or undefined, and the domain of the transducer is 
the data language associated with the acceptor S. This is the same as saying 
that the output function O marks states of S as accepting or rejecting based 
on whether the output function O is defined or undefined at a state. We call a 
data language L over an alphabet S to be streaming-regular if L is accepted by 
a streaming data-string acceptor. 

The data language ii consisting of sorted data strings can be defined by 
such an acceptor using one data variable that remembers the previous data 
value, along with the data variable curr needed to refer to the data value of the 
currently read symbol. Thus Li is a streaming-regular data language. The data 
language L2 consisting of data strings without duplicate entries is not streaming- 
regular by an argument similar to the one for the transduction F^. 

Among different types of automata over data strings that have been studied, 
data automata [3] have emerged as a good candidate for the notion of regularity 
for languages of data strings. However, data automata are too expressive for 
our purpose as they have an undecidable emptiness problem in the presence of 
ordering on data values [3] . 

2.5 Properties 

In this section, we note some properties of streaming transducers aimed at under- 
standing their expressiveness. First, observe that a streaming data-string trans- 
ducer S cannot output "new" data values. That is, for every input data string 
w, any data value appearing in the output data string 15*] (w) must appear in 
some symbol in w. Second, streaming transducers are bounded in the sense that 
the length of the output string is within at most a constant factor of the length 
of the input string. 

Proposition 1. If F is a streaming-regular transduction from S to F, then for 
all input data strings w over S, \F{w)\ = 0{\w\). 

The boundedness depends on the fact that the parallel assignment at each 
step is copyless: each variable can appear in a right-hand-side expression at most 
once. Not only this is crucially needed for decidability of the equivalence problem. 



it also allows an efficient implementation: if tlie data strings corresponding to 
variables are stored in linked lists, tlien the assignment can be executed by only 
changing a constant number of pointers (proportional to the description of the 
transducer, but independent of the lengths of the data strings they store, and 
thus, independent of the length of the input string). 

Proposition 2. If F is a streaming-regular data-string transduction, then given 
an input string w, the output F{w) can be computed in time 0{\w\). 

This also means that the sorting transduction that maps an input data string 
to its sorted version is not streaming-regular due to well-known lower bounds for 
sorting. Obviously streaming transducers cannot capture all linear-time stream- 
ing algorithms. As a specific example, let us revisit the transduction F2 that 
maps a data string w to W1.W2, where wi and W2 are the projections of w con- 
taining private and public entries, respectively. Consider the variation F2 that 
maps ui to a merge of the two projections Wi and W2 taking elements from the 
two lists in an alternate manner. This can be easily implemented in linear-time if 
we maintain two read-heads over the input string, one corresponding to private 
entries and one corresponding to public entries. Note that the emptiness prob- 
lem of finite automata with multiple read-heads is undecidable, and the traversal 
allowed for streaming transducers is restricted by design to ensure decidability 
of key analysis problems. In particular, F2 is not streaming-regular. 

Let us now consider some closure properties for the class of streaming-regular 
transducers. Given two data transductions Fi and F2, and a test L as a data 
language, suppose we want to compute Fi{w) when w ^ L and F2{w) otherwise. 
If all of Fi, F2, and L are specified using SDSTs then we can construct an SDST 
for the desired transduction by a suitably modified product construction. 

Proposition 3. If Fi and F2 are streaming-regular data transductions from S 
to r , and L is a streaming-regular data language over S , then the following data 
transduction F is streaming-regular: for an input data string w over S , if w € L 
then F{w) = Fi{w) else F(u;) = F2{w). 

It turns out that streaming-regular data transductions are not closed under 
functional composition. That is, given two SDSTs 6*1 and 5*2, we cannot always 
construct an SDST S such that S{w) == S2{Si{w)). 

Proposition 4. There exist streaming-regular data transductions Fi and F2 
from S to S such that the following data transduction F is not streaming-regular: 
for an input data string w over S, F{w) — F2{Fi{w)). 

Proof. We choose Z" to be a singleton set, and thus, it plays no role. Consider the 
transduction Fi that maps a data string did2 ■ ■ ■ dk to its reverse dkdk-^i ■ ■ ■ di. 
Fi is streaming-regular. Consider the transduction F2 that maps a data string 
did2 ■ ■ ■ dk to (di)*^. That is, F2 just repeats the first data value for each input 
symbol read. It is easy to implement F2 by an SDST. Now consider the compo- 
sition F — Fi ■ F2- The transduction F maps an input data string did2 ■ ■ ■ dk to 
(dk)''. We can prove that F is not streaming-regular. 



Note that the above proof crucially uses the fact that the data domain D is 
unbounded, and we can always find a "fresh" data value that has not appeared 
in the input string before. If we make D finite, then the streaming-regular trans- 
ducers are closed under composition. 

3 Imperative Programs Updating Linked Lists 

We consider a class of imperative programs that manipulate heap-allocated 
singly-linked list data structures. Each node of the heap stores a tag, a data 
value, and a pointer to another node. For clarity, in this section, we will as- 
sume that the output alphabet is the same as the input alphabet, so we need to 
consider tags of only one type. The input data string is stored in such a heap 
using one node for each position (null pointer indicates the end of the list). A 
list-processing program is invoked with the reference to the head-node of the 
list as input. The program traverses the list using next-pointers, and computes 
using variables that range over tags, over data values, over booleans, and over 
pointers into the heap. It can create new nodes and add them to the heap, and 
can also manipulate the shape of the heap by updating the next-pointers of the 
nodes referenced by its pointer variables. The output data string is returned 
using a pointer-variable that points to the head of the list storing that output. 
During the computation of the program, next-pointers of two heap-nodes may 
point to the same node, and thus, the heap in general has a structure of an 
unordered forest. Since the output is computed by possibly reusing the nodes 
that store the input, we need careful syntactic restrictions to allow a single-pass 
traversal of the input list, while disallowing repeated or nested traversals. We 
require that a typical traversal assignment x := y.next for pointer variables x 
and y is disallowed. The only legal use of the next-field on the right-hand-side 
is in the assignment curr := curr.next, where curr is the unique input pointer. 
Assignments of the form x.next := y to update the heap structure arc allowed, 
provided x and curr are not referencing the same heap-node. An attempt to 
execute x.next := y in a state where x and curr reference the same heap-node, 
causes a runtime error (alternatively we can require each such assignment to be 
syntactically guarded by the boolean condition x ^ curr). 

A program can have additional input and output parameters, and each such 
input/output parameter can be a data value, a boolean value, or a tag. Before 
we describe the syntax and the semantics in detail, let us first consider a couple 
of examples. The following function reverses a list in-place, and corresponds to 
the data transduction Fi: 

function Reverse 
input ref curr; output ref result := curr; 
local ref prev :— curr; 
if curr 7^ nil then { 
curr := curr. next; 
while curr 7^ nil { 
result := curr; curr := curr. next; 



result. next := prev; prev := result; 
}}• 



Suppose given an input data string w and an input data symbol d, we want to 
delete each symbol in w whose data value equals d, and return the resulting data 
string along with a boolean flag that indicates whether or not some symbol was 
actually deleted. The following function implements this: 



function Delete 
input ref curr; input data v; 
output ref result; output bool b := 0; 
local ref prev; 

while (curr 7^ nil) & (curr. data = v) { 
curr := curr. next; b := 1; 

} 

result := curr; prev :=curr; 
if curr ^ nil then { 
curr := curr. next; prev. next := nil; 
while curr 7^ nil { 
if curr. data — v then { 

curr := curr. next; b := 1; 
} else { 
prev. next := curr; prev := curr; 
curr := curr. next; prev. next := nil; 

}; }; }• 



3.1 Syntax 

Types: Variables are typed. The possible types are: bool for Boolean- valued vari- 
ables, tag for variables ranging over the alphabet S, data for variables ranging 
over the data domain D along with an "undefined" value denoted _L, and ref for 
reference variables that index into the data heap along with the null reference 
nil. 

Variable declarations: A program variable is declared along with its type (bool, 
tag, data, or ref) and an annotation which can be either local, input, or 
output. The input annotation means that the variable is an input to the func- 
tion. A function has exactly one input variable of type ref, and can have multiple 
input variables of other types. We will use curr to name this unique input ref- 
erence variable. The output annotation means that the variable is an output of 
the function, and local annotation means that the variable is neither an input 
nor an output. There is exactly one output variable of type ref which is used 
to return a single data string. The declaration of each output and local variable 
has an associated value. The initial value of a variable of type bool or tag can 



be either a constant or an input variable of matching type. The initial value of 
a data variable can be cither _L or an input data variable. The initial value of a 
pointer variable can be either curr or nil. 

Data expressions and assignments: A data expression is of the form (1) a variable 
of type data, or (2) r.data, where r is a variable of type ref , denoting the data 
value stored in the heap-node indexed by r. A data assignment statement assigns 
a data expression to a data variable. 

Tag expressions and assignments: A tag expression is of the form (1) a variable 
of type tag, (2) a constant a from the alphabet S, or (3) r.tag, where r is a 
variable of type ref, denoting the tag value stored in the heap-node indexed by 
r. A tag assignment statement assigns a tag expression to a tag variable. 

Reference expressions and assignments: A reference expression re is either a 
variable of type ref or the constant nil. A reference assignment statement is 
either (1) r := re, where r is a local or a output variable of type ref and re is a 
reference expression, (2) r.next := re, where r is of type ref and re is a reference 
expression, (3) r := new(te, de, re), where r is a local or a output variable of type 
ref and te is a tag expression, de is a data expression, re is a reference expression, 
or (4) curr := curr. next. The first assignment allows reassignment of reference 
variables, except for the input variable curr. The second assignment updates the 
heap by changing the next-pointer of the heap-node indexed by r, provided r and 
curr do not point to the same heap-node. The third assignment creates a new 
heap-node with tag value given by te, data value given by de, and next-pointer 
given by re. The last assignment allows traversal, and is syntactically restricted 
to ensure that only the unique input reference variable is used to traverse the 
input list. 

Boolean expressions and assignments: An atomic boolean expression is either a 
boolean constant (0/1), or tests equality between two tag expressions, or tests 
equality or ordering between two data expressions, or tests equality between 
two reference expressions. A boolean expression is formed from atomic boolean 
expressions using standard logical connectives for negation, conjunction, and 
disjunction. A boolean assignment statement assigns a boolean expression to a 
boolean variable. 

Statements: An assignment statement is either a data assignment statement, 
a tag assignment statement, a reference assignment statement, or a boolean 
assignment statement. A statement s is either (1) an assignment statement, (2) 
a conditional statement of the form if he then s or if be then si else S2, where 
be is a boolean expression, (3) a while statement of the form while be { s}, where 
be is a boolean expression, or (4) a finite sequence of statements. 

Program: A single-pass list processing program P consists of a sequence of vari- 
able declarations followed by a statement. 



3.2 Semantics 

Recall that a program has a single input variable of type ref and a single output 
variable of type ref. The semantics of a program is defined as a partial function 
from an input data string together with values for input data/tag/boolcan vari- 
ables to an output data string together with values for output data/tag/boolean 
variables. For example, the semantics of Delete is a partial function from {S x 
D)* X Dto {S X D)* X {0,1}. 

Configurations: Given a program P, its configuration is completely described by 
(1) the values of its data, tag, boolean, and reference variables, (2) the program 
counter indicating the next statement to be executed, and (3) the data-heap. Let 
Loc be the set of locations in P (this can be the set of vertices in the control-flow 
graph of the program) . A data-heap h consists of a finite set N of heap-nodes, 
a data function f^ : N ^^ D that gives the data element stored at each node, 
a tag function ft : N i-^ S that gives the tag element stored at each node, and 
a next-pointer function fn'-Ni-^ N± that gives the next-pointer of each node, 
where N± is the set N together with the constant nil. A program-configuration 
c of P then consists of a location £ e Loc, a data heap h = {N, fd,ft,fn), and a 
partial function /3 over all the program variables that maps each data variable 
to D, each reference variable to N, each boolean variable to {0, 1}, and each tag 
variable to S. 

Initialization: Given an input data string (ai,di) ■ ■ ■ {(7k,dk), the initial heap 
ft-o consists of the set N = {ni, . . . rifc} of nodes, one per each data symbol of 
the input string. The data function is given by fd{ni) = di, the tag function is 
given by ft{ni) — cFi, and the next-pointer function is given by /„(ni) — n^+i 
for i < k and /„(nfe) ~ nil. The initial location Iq is the unique entry location 
of the control-flow graph. For the initial valuation /3o, /3o(curr) = ni. For all 
other input variables x, /3o{x) is set to the corresponding input value. For all 
local and output variables x, /3q{x) is defined according to the initialization in 
the declaration for x. The initial configuration cq of the program is (^Oi ^Oi M- 

Transition relation over configurations: The operational semantics of programs 
is defined by a transition relation over the configurations. First, given a con- 
figuration c — {i,{N, fd, ft, fn),/3), there is a natural way to evaluate a data 
expression de to obtain a data value c{de) e D, a tag expression te to obtain a 
tag value c{te) G S, a, reference expression re to obtain a value c{re) G N±, and 
a boolean expression be to obtain a boolean value c{be). Every program config- 
uration c = {£,{N,fd,ft,fn),f3) can have at most one successor configuration, 
determined by the statement s at location £. The details are standard, and we 
illustrate them using a few cases. 

Suppose the statement is a conditional statement £ : If b then £i : si else £2 '■ 
S2- Then, if c{b) = 1 then the successor configuration of c is {£i,h,j3), and if 
c(6) = then the successor configuration of c is {£2,h, /3). 



Suppose the statement s is a reference assignment statement £ : r := 
new(te, de, re). The effect of executing the statement s updates the control lo- 
cation from £ to the unique successor location £' of the statement s. For the 
updated data heap h' , the set of nodes is TV U {n}, where n ^ A^ is a "new" 
heap- node, the data function is fd[n h- > c(de)], the tag function is ft[n n- c{re)], 
and the next-pointer function is fn[n n- c{re)]. The updated valuation /?' is 
/3[r I— >■ n]. 

Suppose the statement s is a reference assignment statement £ : r.next :— re. 
If c{r) = c{curr) then this is an error and the configuration c has no successor. 
Otherwise, the successor configuration is c' such that the location £' is the unique 
successor location oil in the control-flow graph, the valuation /3 stays unchanged, 
and the updated heap is (TV, /rf, ft, fn[c{r) i— > c{re)\) (that is, the next-pointer of 
the node c{r) in the heap changes to c{re) which may be nil or a heap-node). 

Termination and output: An execution of the program is obtained by starting 
in the initial configuration cq and continuing with the successor configuration as 
long as possible. If this execution is infinite, then the program is non-terminating, 
and the output is undefined. Suppose the execution is finite and ends in the 
configuration Cf = {if,hf,Pf). If the location £f is not the unique exit location 
of the control-flow graph, then again the output is undefined. Otherwise, the 
returned value of each output data/tag/boolean variable is given by the final 
valuation /3f of program variables. For the unique output reference variable r, 
let {ai,di){a2,d2) • • • be the unique sequence of tag/data values stored in the 
heap hf starting at the node Pf{r) following the next-pointers until the nil 
value is encountered. If this sequence is infinite, this indicates that the program 
created a cycle in the heap during its computation, and the output is again 
undefined. If this sequence is finite, it is the returned output data string. 

3.3 Streaming transducers with e-transitions 

We extend the model of streaming data-string transducers by allowing the trans- 
ducer to update its state, data variables, and data string variables using an e- 
transition that does not consume an input symbol. We will first show that it is 
possible to eliminate such e-transitions, and then we will translate list-processing 
programs to transducers with e-transitions. 

The definition of a (deterministic) streaming data-string transducer S with e- 
transitions extends the definition of SDSTs as follows: in a transition (g, a, (p, g', a), 
a can now also be e, provided there is no transition of the form (g, ct', </?', g", a') 
with a' S S. The restriction is needed for ensuring determinism: in a state g, 
either all outgoing transitions are e-transitions, or all outgoing transitions have 
non-e tags (and thus consume the next input symbol) . Note that the original de- 
terminism requirement still applies: if there are multiple transitions with same 
source state and same input tag (which now may be e), their tests must be 
mutually exclusive. 

As in case of SDSTs, a configuration consists of a state q and a valuation /? 
for the data and data string variables. The definition of the transition relation 
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{q, f3) -—^ {q',/3'), ioi a E S and d E D, is unchanged. The e-transitions are 
defined by: if there exists a transition {q,e,ip,q' ,a) such that P{(p) — 1 then 
(g,/3) — > {q',l3 • a). A run over the input data string w is obtained by start- 
ing in the initial configuration (goj/3o)i and applying either an e-transition or a 
transition corresponding to the next input data symbol until all the input data 
symbols are consumed and no more e-transitions are possible. This ensures de- 
terminism: for a given input string w, there is at most one configuration ((?,/?) 
such that (1) (go,/3o) — > {q,f3) and (2) ((?,/3) has no e-successor. The semantics 
IS'Kw) is defined to be (3{0{w)) in such a case, provided 0{q) is defined, and is 
undefined otherwise. Note that it is possible that such a transducer keeps on ex- 
ecuting e-transitions without terminating, and in such a case, the corresponding 
output is undefined. 

It turns out this extension does not add to the expressiveness: 

Proposition 5. Given a streaming data-string transducer S with e-transitions, 
one can effectively construct a streaming data-string transducer (without e-transitions) 
S' such that [S*! = |S"] with the same number of states, same number of data 
variables, and the same number of data string variables. 



3.4 Prom single-pass programs to streaming transducers 

In this section, we describe how to translate single-pass list processing programs 
to streaming data-string transducers. The first step is to view the semantics of 
a single-pass list processing program as a partial function from data strings to 
data strings. To associate such a data string transduction |P] with a program 
P, we encode input parameters in the same manner as described in Sec. 2.1. 
If P has ki input boolean/data/tag variables and kg output boolean/data/tag 
variables, then we prefix the input data string with ki symbols each encoding 
one input argument, and prefix the output data string with ko symbols each 
encoding one output value. 

The main challenge in the construction is to store the information in the 
data heap used by the program P using a bounded number of data and data 
string variables in the corresponding transducer S. Figure 2 shows a possible 
configuration of the data heap that the program accesses using the reference 
variables curr and ri, r2, ra, r4. The first observation is that the heap-nodes that 



are not accessible from any of the reference variables are not relevant to the 
execution of the program, and can be ignored. Second, nodes such as rig and 
riio that are accessible from curr.next can contain only input symbols that the 
program has not processed so far. These nodes have not influenced the execution 
of the program so far, and information in these nodes does not need to be 
stored. When the program executes the statement curr :— curr.next, the node 
ng becomes relevant. This step is analogous to the transducer S processing the 
next input symbol. 

Compressing the heap using a bounded number of strings is achieved using an 
encoding similar to [14] . A node is called a referenced node if a reference variable 
points to it. In the example, no, ^4, riy and ng are referenced nodes. Information 
in such nodes needs to be stored explicitly by S. For each reference variable r of 
P, S maintains a data variable d,, and a tag variable tj. storing the information 
in the node that r points to. A node such as ri3 is called an interruption node as 
two nodes point to it (and both these nodes are accessible from the program's 
reference variables). If P has k reference variables, then there can be at most 
2fc — 1 interruption nodes. The stretches rii, 712 and n^, n^, rig arc uninterrupted 
heap segments. In each such segment (1) the first node is either an interruption 
node, or is the next-successor of a referenced node, (2) the next-pointer of each 
node in the sequence points to the next node in the sequence, (3) no node other 
than the first is an interruption or a referenced node, and (4) the next-pointer of 
the last node is either nil or points to an interruption or a referenced node. If P 
has k reference variables, then there can be at most 2fc — 1 uninterrupted heap 
segments. The sequence of data symbols stored in an uninterrupted heap segment 
is stored in a data string variable by S. In our example, the data string wi stores 
the data symbols in ni , 712 and the data string W2 stores the data symbols in 
713, 714, 715. The finite-state control of 5* remembers the shape of the heap: ri and 
r2 point to the same node, the next-pointer of the ri-referenced node points to 
the data string stored in wi , the next-pointers of the node referenced by r^ and 
of the last node of wi point to W2, W2 is followed by the r4-referenced node, 
which is followed by the cwrr-referenced node. Such a shape can be captured by 
a function /„ : Y 1-^ Y±, where Y contains all the reference variables of P and all 
the data string variables of S that store the data strings in uninterrupted heap 
segments. If P executes the assignment r^.next := curr, then 773 is no longer 
an interruption node, and in this case, the two uninterrupted heap segments 
collapse into one. This is achieved by S by the assignment [wi, W2] '■= [wi.W2,s] 
to the data string variables, updating the tag/data variable corresponding to rs, 
and changing the shape by updating fn{wi) to r4 and fnir^) to curr. 

Proposition 6. Given a single-pass list-processing program P one can effec- 
tively construct a streaming data-string transducer S with e-transitions such that 
|P] — ISJ . If P has m locations, k^ reference variables, k^ boolean variables, kt 
tag variables, and kd data variables, then S has kd -f kr data variables, 2kr data 
string variables, and 0{m ■ 2'^'' • fc^"^ • |£'|'=t+'='-) states. 

Proof. The transducer S has a data variable for each data variable for P, and also 
for each reference variable of P (to store the data values in referenced nodes in 



the heap) . It has 2kr data string variables to store uninterrupted heap segments. 
The state of S stores (1) the location of control of P, (2) the boolean value of 
each boolean variable of P, (3) the tag value of each tag variable of P, (4) a 
partition of the reference variables of P (two reference variables are in the same 
partition if they point to the same heap-node), (5) for each equivalence class 
in the partition, either a tag value stored at the node referenced, or nil, and 
(6) for each data string variable and each equivalence class of the partition, a 
next value that gives either a data string variable or an equivalence class of the 
partition. The last component stores the shape of the heap. The bound on the 
possible number of states follows by a simple counting argument. 

If P has data, boolean, or tag input variables, the transducer 5* first scans 
the initial prefix of data symbols setting up the initial values using e-transitions. 
Then the transducer processes the first input symbol, assigning the tag-component 
corresponding to curr in its state to tag of the input symbol, and assigning the 
data variable corresponding to curr to data of the input symbol. After this 
phase, the control location is set to the unique entry location, all output strings 
arc empty (there arc no uninterrupted heap segments in the initial heap), and 
the partition has only two classes (some reference variables are nil and some 
are in the class that contains curr). 

We now describe transitions of S corresponding to statements of P. The only 
statement that causes a non-e transition is the statement curr := curr. next. The 
transition that corresponds to this statement in S: (1) changes the stored control 
location of P, (2) changes the partition of reference variables into equivalence 
classes: curr is split from its current equivalent class, (3) a new tag value is 
stored for the new equivalence class of curr^ (4) for each reference variable p 
from the previous equivalence class of curr the new next value of p (i.e. fn{p)) 
will be curr^ (5) if for any data string variable x (that stores an uninterrupted 
heap segment) of S we have (/„(x) = curr) then we append the current data 
symbol (from before the transition is executed) to x. All other statements are 
captured by an e-transition, as they do not correspond to the move of the head 
of the automaton. Boolean, tag, and data assignments can be simulated directly. 
We have already described using an example how a statement r. next — curr can 
affect the shape of the heap, and how this is captured by assignments that S 
can perform in its transitions. 

In Section 2.5, we have mentioned that the assignments that a streaming 
transducer performs on its data string variables can be executed by only chang- 
ing a constant number of pointers. A list-processing program equivalent to a 
transducer stores data strings in list segments on the heap, and keeps pointers 
to the first and last nodes of the segment. To perform an assignment of the form 
X :— xy, the program performs commands xl.next :— yf; xl :— yl, where xf {xl) 
is a reference to the first (last) node representing x (and similarly for y). We 
obtain the following proposition: 

Proposition 7. Given a streaming data string transducer S, one can effectively 
construct a single-pass list-processing program P such that |5] = |P] . 



4 Functional programs on lists 

We consider a simply typed functional language with types bool, tag, data, list, 
and function types, with letrec recursion and pair and list constructors. As in 
the previous section, we assume that S ^ F for ease of presentation. The terms 
are defined by: 

t:= true | false I isTrue t I isFalse t 

S I isS {for all S G tag) 

X I fun x:T.t I t t I if t then t else t 
X = X I X < X I {t,t} I t.l I t.2 
nil I cons t t I append t t 
isnil t I head t I tail t 
let x=t in t I letrec x:T=t in t 

The operators = and < apply to terms of type data. The lists are of type 
(list (tag X data)), or list for short. The list and pair terms are standard. 

We defined a class of functional programs that intuitively captures single- 
pass functions that process a list by recursing through it from left to right. 
The recursion restriction we define is a minor generalization of tail recursive 
functions, where we allow the caller to perform operations on, but not test, the 
values that the callee returns. This allows capturing common routines such as 
insert, delete, and reverse (both tail-recursive and non-tail recursive). We allow 
wrapper functions in order to enable the standard programming style for tail 
recursive functions. 

A function is a single-pass list processing function iff it is defined by the 
following term: 

fun a: list x Tl . 

letrec f:((list x Tl x T2) -^ (list x T)) = t 
in (f a initValues) 

This term encodes a wrapper function that can pass some additional arguments 
to a recursive function f . The following conditions are required to hold: 

— one list: Let us denote the first argument to f (by definition of type list) by 
1. The list 1 is the only list accessed by isnil, head and tail. The contents 
of the other list variables are thus not examined. However, it is possible to 
use cons or append with these variables. 

— type restrictions: The types Tl and T arc products of types bool, tag, data, 
with possibly more than one component of each type. These are input and 
output arguments of the list processing function. The type T2 is a product 
of types bool, tag, data, and list, with possibly more than one component 
of each type. Intuitively, these are buffers where the recursive function can 
store results. 

— recursion restriction: The first argument to any recursive call is the term 
(tail 1), where 1 is the first argument to f. Every recursive call to f is 
enclosed in an expression e defined by let r =(f (tail 1) a) in t. Fur- 
thermore, e is the last expression the caller evaluates, i.e. t is the caller's 



result. Using the type restriction above, we have that r = li, ri, . . . , rn, and 
a = ai . . . an, and t = I2, tj, . . . t^ , where Ij is a hst and I2 is a hst ex- 
pression, each of ri, . . . rn and ti, . . . tn is of type bool, tag or data, and 
each of ai, ... an is of type bool, tag, data, or list. We have the foUowing 
restrictions on these subexpressions: (i) a hst variable of f can appear in at 
most one expression ai or tj (this is similar to the restriction that requires 
the assignments of SDSTs to be "copyless"), (ii) if ti is of type bool, then 
the only variable from r it can use is r^, (iii) if t^ is of type tag or data, 
then we have ti = ri, (iv) the only variable from r that I2 can use is li. 

— ter?7i t: The term t is of the form f un x: (list x TI x T2) . t', where t' 
does not contain the function definition term fun or the recursive definition 
term letrec. 

— initValues: The expression initValues is of type T2. If T2 contains list 
types, the corresponding values in initValues are nil. 

Note that all the above conditions can be checked syntactically. 

4.1 Examples 

The data transduction Fi that reverses a list can be encoded as a single-pass 
list processing function as follows: 

fun l:list. 

letrec reverse: (list x list — > list) 
= fun l:list. fun result: list, 
if (isnil 1) then result 

else (reverse (tail 1) (cons (head 1) result)) 
in (reverse 1 nil) 

A function that given a list I and a data value d deletes all occurrences of d 
in I is encoded as follows: 

fun l:list. d:data. 

letrec FuncDelete: (list x data — > list) 
= fun l:list. fun d:data. 
if (isnil 1) then nil 
else 

if (head l).data = d 

then (FuncDelete (tail 1) d) 
else (cons (head 1) 

(FuncDelete (tail 1) d)) 
in (FuncDelete 1 d) 

One of the recursive calls in FuncDelete is enclosed in an expression that adds a 
cell to the front of the list: (cons (head 1) (FuncDelete (tail 1) d)).This 
recursive call is not tail recursive, as the caller function applies an operation 
to the result returned by the callee. However, the recursive call satisfies the 
recursion restriction from our definition of single-pass list processing functions 
(and it satisfies the other restrictions as well). 



The constructions in this paper lead to an algorithm to check if the im- 
perative function Delete of Section 3 and the above function FuncDelete are 
semantically equivalent (i.e., specify the same transduction). Such a check can 
be used for full functional verification of one using the other as the specification. 



4.2 From list processing functions to streaming transducers 

Semantics The simply-typed functional language we defined contains standard 
terms. The values of the language, as well as the typing rules, and the evalua- 
tion relation ti — > t2 for all the terms are omitted here, as they can be found 
in a standard textbook [16]. The operational semantics is given by a transi- 
tion system, whose states are subterms of f , whose transitions are given by the 
evaluation relation ti — >■ t2, and whose initial state is the term f . 

As for imperative programs, the semantics of a single-pass list processing 
function can be viewed as a data string transduction, that is, a partial function 
from data strings to data strings. To associate a transduction |f ] with a single- 
pass list processing function f : (list x Ti) — >■ T, we encode input parameters in 
the same way as in Section 3. Given a data string w, its first ki symbols can rep- 
resent ki input boolean/data/tag variables, and the rest represents the input list 
(converting a data string to a list term of type list is straightforward). Given a 
data string, the function decode _param{w, kj) returns the parameter values, and 
the function decode Jist(w, ki) returns the tail of the input string represented as a 
list term. Similar encoding to data strings can be used for output values and out- 
put lists. Given a tuple {li, ri, . . . , rn} of type list x T, enc_res({li, ri, . . . , rn}) 
returns the corresponding data string. Given a single-pass list processing func- 
tion f, we have that |f|(w) = w' iff (f decode _list{w) decode _param(w)) — >* 
{/i,ri, . . . ,r„} and enc_res{li,ri, . . . ,r„) = w. 

Proposition 8. Given a single-pass list processing function f : (listxTj) — > T, 
one can effectively construct a streaming data string transducer S, such that 
|f ] = |S'] . Let g be the recursive function used by f. If g has kb boolean variables, 
kt tag variables, kd data variables, and ki list variables, then S has 0{2^'' ■ \S\^^) 
states, kd data variables, and fc; + 1 data string variables. 

We note that the construction used in the proof of Proposition 8 is more di- 
rect than the one used in the proof of Proposition 6. The list variables (apart 
from the list that is traversed) and data variables are modeled directly by data 
string variables and data variables of the transducer, and the control state of 
the transducer encodes the value of boolean and tag variables. 

Given an SDST S, we can construct an equivalent list-processing program f . 
We first describe the arguments of the recursive function g the function f uses: 
its boolean arguments encode state of S, its data arguments correspond to data 
variables of S, and its list arguments correspond to the data string variables of 
S. A transition (g, ct, (f, q' , a) is translated by making the function g test whether 
the current boolean arguments correspond to q, whether the current tag is a, 
and whether ip holds for the current data arguments. If so, the function makes 



a recursive call with parameters encoding q' and the assignments from a. The 
function we obtain in this way is tail recursive. The next proposition follows: 

Proposition 9. Given a single-pass data string transducer S , one can effectively 
construct a single-pass list processing function f, such that \S\ = |f]. 

5 Decision Problems 

In this section, we prove that the equivalence problem and the pre/post condition 
checking problem are decidable for streaming data string transducers. We also 
show that, for a number of extensions of the streaming transducer model, already 
the basic analysis problem of reachability is undecidable. 

5.1 Sound and complete abstraction for order and equality 

In proofs of decidability of equivalence and pre/post condition checking of SDSTs 
that operate on an infinite data domain _D, we will construct finite state systems 
that do not store values of the data variables of the SDSTs, but only keep track of 
order and equality predicates. In order to prove that such an abstraction is both 
sound and complete for analysis problems, we will need the lemma presented in 
this subsection. 

Let y be a set of variables that range over D. We fix V and an infinite D for 
this subsection. We will consider pairs of the form [V^, p), where the set V^ C V 
represents the set of variables with a defined value, and where p is an ec-order 
on V^ (short for order on equivalence classes). An ec-order p = (=p,<p) is a 
pair where the first component is an equivalence relation on V^, and the second 
component is a strict total order on equivalence classes of =p. For data variables 
wi,W2, we write vi <p V2, if wi belongs to an equivalence class ci, V2 belongs 
to an equivalence class C2, and Ci <p C2- For example, ii V ^ {vijV2,V3}, all 
variables have a defined value, then a possible ec-order on V^ can be represented 
as Vi =p V3 <p V2- Let /3 be a valuation of data variables as in the definition 
of SDSTs. A pair {V'^,p) represents a set of valuations. We write /3 \= [V^.p) 
iff (3 is defined precisely for the variables in V''', and for all wi,'y2 G V'^ we 
have that /3(wi) < Ii{v2) iff' wi <p ^2, and (i{vi) = /3(w2) iff vi =p V2- Let ip 
be a Boolean combination of constraints of the form vi < V2 and vi — V2 for 
variables wi,W2 € V. Let a be a map from 1/ to V modeling assignments, as in 

the definition of transitions of SDSTs. We write Pi — '- — > (32, if /^i satisfies (f 
and (32 ~ Pi- a, similarly to the definition of SDSTs. For pairs {V^, p), we define 

a transition relation (V^jp) — '■ — > {V''^,p') iff (a) V"^ contains the variables 
which were assigned to by a from variables in V^, (b) p implies cp, and (c) p' 
is the ec-order obtained from p by executing a. Let u" be a sequence of pairs 
{Vf, Pi){V2, P2) ■ ■ ■ iYn^ Pn)- Let M be a sequence of valuations (3i(32 ■ ■ ■ /3„. Let 
upd be a sequence of pairs ((/?i, ai) ((^2,0:2) ••• (</5n-i, Q!„_i). The sequence u°' 

conforms to the sequence upd if for all i ii 1 < i < n, then {V/,pi) — > 



i^i+iT Pi+i)- Similarly, the sequence u conforms to the sequence upd if for all i 

if 1 < i < n, then /3i — > Pi+i. 

The proof of the following lemma crucially uses the fact that the infinite 
totally-ordered data domain D contains chains, that is, sequences of elements 
in an increasing order, of unbounded length. The proof is omitted here in the 
interest of space. A similar proof is a part of the proof of Theorem 1 of [2] . 

Lemma 1. Let upd be a sequence of pairs {ipi,ai){ip2,a2) ■ ■ ■ 
{(Pn-i,Oin-i)- Let u"" be a sequence of pairs {Vi,pi){Vi,pi)... 
{V^,Pn), such that m" conforms to upd. Then there exists a sequence of val- 
uations u = l3i/32 ■ ■ ■ Pn such that u conforms to upd and for all i, if 1 <i <n, 
then Ui \— uf . 



5.2 Equivalence checking 

Given two streaming data-string transducers 5*1 and 52 from S to -T, the stream- 
ing transducer equivalence problem is to determine whether l^i] = 15*2]. 

In order to show that the problem can be solved in Pspace we reduce the 
problem to a reachability problem in 1-counter machines. A 1-counter machine M 
is a tuple (Qm, ^Mi qcf , Fm), where Qm is a set of states, q^ is the initial state, 
and Fm C Q^ is a set of final states. The transition relation 6m is a relation 
in Qm x Qm x {—1, 0, 1}. Note that 1-counter machines do not test the content 
of the counter. A configuration of the 1-counter machine is a pair in Q x Z, 
that is, it consists of a state and the value of a counter. A transition relation 
-> on configurations is defined as follows: {q,z) — )■ {q',z') iff {q,q',c) G Sm and 
z' — z + c. The 1-counter 0-reachability problem is to decide whether there 
exists a state q € Fm such that {q^,0) -^* (q,0). This is a special case of the 
empty-stack reachability problem for pushdown automata. While the latter is 
PxiME-complete, the following lemma shows that the former is in Nlogspace. 

Lemma 2. The 1-counter 0-reachability problem is in 

Nlogspace. 

Proof. Consider a 1-counter machine M — (Qm ,SM,qQ^ , Fm)- We observe that 
for all q, q' E Qm, if there is a path (q, 0) -^* {q' , 0), then there is such a path 
with stack depth bounded by n^. This is a consequence of a summarization-based 
reachability algorithm (easily adapted from summarization-based reachability 
algorithm for pushdown automata), which computes summaries for pairs {q, q'). 
The iteration in which a pair (g, q') gets added is the minimum absolute value 
of counter needed to reach from (g, 0) to (g', 0). The number of iterations is at 
most the number of summaries, that is, n^ . Note that this observation holds for 
all pushdown automata. 

We can thus assume that the counter ranges over (— n^,n^). State of a 1- 
counter machine is [q, z) , where z is the value of a counter. Therefore we need to 
consider only 0{n^) possible configurations. (This statement does not hold for 
general pushdown automata). Thus our reachability problem is a reachability 



problem in a graph with 0{n^) states. The problem can be therefore solved in 
space 0{\ogn^). 

Theorem 1. The streaming data-string transducer equivalence problem is in 

PSPACE. 

Proof. Let us consider two streaming data-string transducers 5*1 and ^2 from E 
to r. They are not equivalent if there exists an input data string w over S such 
that one of the following three conditions hold: (i) |5i](ii;) is defined, but |S'2|(w) 
is not (or vice- versa), (ii) [5*1] (w) and |52](w) arc defined, but have different 
lengths, (iii) |S'i](u;) and 15*2] (w) are defined and have the same lengths, but 
there exists a position p such that the data strings |S'i|(w) and |S'2](w) differ 
at the position p. 

We construct a f-counter automaton and designate a state q such that q is 
0-reachable in M if and only if Si and 5*2 are not equivalent. The automaton 
M nondeterministically chooses which type of difference (of the three described 
above) it will find. We only describe here how M can determine that there is an 
input string such that the p-th output symbol of Si is different from the p-th. 
output symbol of S'2. The construction for the other two cases uses similar ideas 
and is simpler. 

The automaton M nondeterministically simulates Si and S'2 running in par- 
allel. It keeps track of states of Si and S'2 precisely, but only keeps some infor- 
mation on the data and data string variables. Intuitively, M guesses during the 
course of simulation of Si (resp. S2) where the position p in the output is, and 
uses its counter to check that the guess is the same for Si and S2. 

For each data string variable, M guesses (at each step) where the contents 
of the variable will appear in the output with respect to the position p. More 
concretely, for each data string variable x of both Si and S2, M guesses which 
of the following categories the variable is in: (i) left of p (Class L), (ii) center, 
i.e. position p is in this string (Class C), (iii) right of p (Class R), (iv) x does 
not contribute to the output (Class N). 

Maintaining consistency of assignment of data string variables into these four 
classes is straightforward. First, consider the case when at a particular step. Si 
performs an assignment y :— {a,vi)z{b,V2) and M guesses that the contents 
y will appear to the left of the position p in the output of Mi. To verify that 
this guess is consistent with previous guesses, M checks that in the previous 
step, z was in Class L. The assignment caused two output symbols (a,Wi) and 
{b,V2) to appear to the left of the position p, therefore M increases its counter 
by 2 (outputs of S2 are taken into account by decreasing the counter rather 
than increasing). Second, if at a particular step. Si performs an assignment 
X :— {a,vi)y{b,V2)z, and M guesses that the symbol (6,^2) in this assignment 
will be at the position p, then: (i) M verifies that at the previous step, y was in 
Class L, and z was in Class R, (ii) M increases its counter by two in order to 
simulate the fact that Si outputs (a, wi) and (6,^2) (as before, when simulating 
S2, M decreases the counter), and (iii) M assigns x to Class C. Note that initially, 
no variable is assigned to Class C, and at each step, at most one variable is 



in Class C, because of the copy less assignment restriction. The cases when M 
guesses that a variable to which Si (resp. ^2) assigns is in Class R or Class N 
are similar. 

In the remainder of the proof, we assume that the data domain D is infinite. 
If it is finite, the automaton can directly store values from D in its finite-state 
control, and the construction is simpler. 

For data variables, M keeps track of which variable is defined, and for the 
defined variables, it keeps track of the ordering and equality information. More 
precisely, let us consider the following set of variables V = {Vi \ {curr^}) U (V2 \ 
{cur'P'}) U {upi, up2i curr}, where Vi and V2 are the sets of data variables of Si 
and 5*2, and vpi and vp2 are used by M to store information about the data 
value Si and S2 output at position p. The automaton M stores a pair (y',p), 
where V^ C V that contains all of the variables whose values are defined in 
computation of 5*1 and ^2, and p is an ec-order on V^. The pair {V^, p) is updated 
as steps of Si and ^2 are simulated and their transitions are executed. Note that 
M maintains only one variable curr common to Mi and M2 because the two 
automata are running on the same input. The final part of the construction is 
the maintenance of vpi^ the variable used to store the output of Si at position 
p. When M guesses that the output symbol of Si at position p will be one in the 
right-hand side of the assignment (such as x := {a,vi)y(b,V2)z) it is simulating 
currently, it assigns x to Class C as above, and if it guesses that at position p 
is the symbol (6, W2), then : (i) the value b from S is stored in the finite state 
control of M, and (ii) vpi is added to V^, the set of defined variables, and vpi is 
added to the equivalence class of V2 in p. The construction for vp2 is analogous. 

To summarize, a state of M consists of (1) a state of Si, (2) a state of 5*2, (3) 
a set V^ QV representing the defined variables, (4) an ec-order p, (5) a partition 
of the data string variables of Si and S2 to classes (as described above) and a 
symbol from F at position p for S'l and S2 (if M guessed that the output to 
position p was already performed). The set of states of M is thus a product: 
Qi X Q2 y. 2^ X p X Qb, with the components corresponding to items (1) to 
(5). The initial state of M is the tuple containing initial states of Si and S'2, 
with the set V^ empty — all the variables are undefined, and the component 
(5) has a special value i. From this state there are nondeterministic transitions 
which choose the initial assignments of data string variables to classes. The other 
transitions are as described above. The set of final states consists of states where 
either the variables vpi and vp2 are defined, but p does not imply vpi ~ vp2 or 
the r symbols stored in the finite-state control of M for position p in output 
strings of ^i and ^2 differ. 

We now prove that a final state of M is 0-reachable iff there exists an input 
data string w and a position p, such that |S'iJ(w) and |52](w) differ at position 
p. We will need the following notion that relates configurations of M to configu- 
rations of Si and 52. Let ci = {qi,l3i) be a configuration of Si, let C2 = (^2, /32) 
be a configuration of ^2 and let cm — {{qi' ,Q2^ ,y^, P: QB),e) be a configuration 
of M (e is the value of the counter). The configuration cm is an abstraction of 



configurations (ci,C2) (denoted by q:((ci,C2)) — cm) iff tlie following conditions 
hold: 

— States: the states of 5*1 and S2 are the same in ci and C2 as they are in cm- 

— Data variables: /3i or /32 are defined for each of the variables in V^, and the 
values /?! and (32 assign to variables in V^ are consistent with p 

— Data string variables: Let e\ be the number of symbols in the data string 
variables of 5i that are assigned to Class L in Qb- Class C in Qb contains 
by construction at most one data string variable of Si . If Class C contains 
a data string variable xi of 6*1, then we can designate a position pi in the 
data string in xi. Let Cq be the number of characters to the left of pi in 
Xi- The values e|^ and e^ arc defined analogously for 6*2 and a position 
P2 in a data string variable X2- The following equality is required to hold: 
^L~^^c~ (^L "I" ^c) ~ ^' where e is the counter value in cm- Furthermore, 
let di be the data value at position pi . We have that the equality and order 
relations that p contains on vpi and the other data variables hold for di and 
the values of these data variables given by /3i and (32 ■ An analogous condition 
holds for the data value at position p2. 

Claim 1 The automaton M can reach the configuration cm = ((<?i , 92, V^ , P, Qb), c) 
in k steps iff there exists an input string w of length fc, such that 5*1 (52), after 
traversing this input, reaches a configuration ci (02), and a{{ci,C2)) — cm- 

The claim is proven by induction on k. The more difficult part of the proof of 
the claim is the left-to-right implication, where we are required to find an input 
string w that satisfies the condition. We need to find a sequence of valuations 
(3 that is the same as the sequence of pairs (V^jp) given by the sequence of 
configurations of M. It is here that Lemma 1 is used. 

Using Claim 1, wc now prove that a final state of M is 0-reachable implies 
that there exists an input data string w and a position p, such that |S'i](w) and 
IS'2] (w) differ at position p. A final state of M is a state where we do not have 
vpi = vp2 or where the F symbols stored for positions pi and p2 differ. By Claim 
1, this means that there is a position pi in the output of 5*1 and a position p2 in 
the output of ^2 where the data values or the F symbols differ. If such a state is 
0-reachable, (using Claim 1) we get that e\ + e]j — (e| + e^) — 0, which implies 
P1—P2 = 0, which implies that pi ^ P2- The other implication can be also easily 
shown using Claim 1. 

Complexity Checking whether a particular final state of M is 0-reachable can 
be done in Nlogspace (Lemma 2). A nondeterministic algorithm first guesses 
which final state is reachable, and then checks its reachability in Nlogspace. 
The number of states of the 1-counter automaton M we constructed is linear 
in the number of states of ^i and ^2 and exponential in the number of data 
string and data variables of Si and S2- Furthermore, given two states of M, one 
can decide (in polynomial time in the number of variables) , whether there is a 
transition between the two states. We thus have that the streaming transducer 
equivalence problem is in Pspace. 



Theorem 1 implies that checking equivalence is in Pspace for list-processing 
programs from Section 3 and list-processing functions defined in Section 4. The 
reason is that the number of data and output variables of the resulting transducer 
is linear in the size of the program (more precisely, in the number of data and 
reference variables of the program) . 



5.3 Checking pre/post conditions and assertions 

Let S* be a streaming data string transducer S from S to F. Let Ai be a stream- 
ing data string acceptor on S, and let A2 be a streaming data string acceptor 
on r. The triple {Ai}5'{A2} holds iff for all input data strings w over E we 
have that if Ai accepts w and |S'](w) = w', then A2 accepts w'. The pre/post 
condition problem for SDSTs is to determine, given Ai, S, and A2, whether 
{^i}S'{j42} holds. Pre-post condition checking is useful in the context of verifi- 
cation, because we can, for example, ask whether a transducer that takes a sorted 
list (with respect to an ordering on S) as an input returns a sorted list (with 
respect to an ordering on F) as an output. The upper bound in the following 
theorem is obtained by reduction to the emptiness problem in nondeterministic 
finite automata (NFAs). 

Theorem 2. The pre/post condition problem for SDSTs is in PSPACE. 

The above definition of pre/post condition checking corresponds to partial cor- 
rectness. We can also check total correctness: there is a Pspace algorithm to 
check, given S, Ai, and A2, is it the case that for all input strings w accepted 
by Ai, \S\{w) is defined and A2 accepts |S'](w). 

The constructions discussed so far can also be used to solve a number of 
assertion checking problems for single-pass list-processing programs. 
Reachability Given a single-pass list processing program P, a location I of P, 
and a streaming data string acceptor A, is there a data string w accepted 
by A such that starting from the initial heap that stores w, there is an 
execution of P leading to a configuration with location t! For this, we need 
to construct the SDST corresponding to P as discussed in Section 3.4, and 
simulate it on an input together with A. The complexity is Pspace. The same 
construction can be used if additional constraints are specified on boolean 
and tag variables of P at the end of the execution. 
Pointer analysis Given a single-pass list processing program P, two pointer 
variables x and y, and a streaming data string acceptor A, is there a data 
string w accepted by A such that starting from the initial heap that stores 
w, there is an execution of P leading to a configuration in which both x and 
y point to the same heap-node? Recall that the compilation of programs 
into SDSTs keeps track of such aliasing relationships, and has the necessary 
information to answer such a query. We can also check if a pointer variable 
r is guaranteed to be non-null whenever it is dereferenced (using expressions 
such as r.next and r.data). 



Heap-cycles detection Given a single-pass list processing program P and a 
streaming data string acceptor A, is there a data string w accepted by A 
such that starting from the initial (acyclic) heap that stores w, there is 
an execution of P leading to a configuration in which the heap contains a 
cycle (formed by next-pointers of heap-nodes)? Again, the compilation of 
programs into SDSTs keeps track of the heap shape, and can be used to 
solve this problem in Pspace. 

5.4 Undecidable extensions 

Two-way data string transducers A two-way (deterministic) data string trans- 
ducer (2DST) is an extension of the streaming data string transducer model, 
where at each step, the transducer can decide whether to move left or to move 
right or to stay put. More precisely, a transition of a 2DST is defined by a a tuple 
{q, a, Lp, q', a, C), where q, a, ip, q' and a are as for SDSTs, and ( is in {^, I, -^}. 
For 2DSTs, we assume that the input data string is enclosed by two special 
symbols h, H. The machine stops when it reaches a final state. If the machine 
never reaches a final state, or if it tries to move left while the tag is h or it tries 
to move right when the tag is H, then the output undefined. Given a 2DST S 
and a state q of S, the 2DST reachability problem is to determine whether there 
exists a data string w such that S enters the state q while processing w. 

Theorem 3. The 2DST reachability problem is undecidable. 

The theorem is proven by reduction from the undecidable problem of emptiness 
for two-counter automata. The main step of the proof is to show that a 2DST can 
recognize whether the input data string encodes a computation of a two-counter 
machine. The proof uses the fact that the data domain is ordered. 

Programs with multiple traversing pointers The class of imperative list process- 
ing programs considered in Section 3 restricts how next pointers of heap nodes 
can be traversed: there is one special pointer variable curr^ and it is the only 
pointer variable that can traverse the next pointer. Now consider the class of 
programs, denoted by PMTP (short for programs with multiple traversal point- 
ers), obtained by lifting this restriction, and allowing assignments x := y.next 
for any two pointer variables x and y. Given a program P from the class PMTP 
and a location i, the PMTP reachability problem is to determine whether there 
exists a data string w such that starting from the initial heap that stores w, 
there is an execution of P leading to a configuration with location i. 

Theorem 4. The PMTP reachability problem is undecidable. 

The proof of the undecidability is again by a reduction from the reachability 
problem for two-counter automata. The basic observation is that if multiple 
pointers can traverse the heap simultaneously, the program can check whether 
two successive parts of the heap encode two successive configurations of the 
two-counter machine. 



Data string variable equality While a number of analysis problems for SDSTs, 
and assertion checking problems for single-pass list-processing programs, are 
decidable, checking whether the transducer/program can reach a configuration 
where the contents of two string variables are the same, is undecidable. Given 
an SDST 5, a state q of S, and two data string variables x and y of S, the 
data string variable equality problem is to determine whether there exists a data 
string w such that 5* reaches a configuration where x = y and the state is q. 
The following theorem is proven using a reduction from Post's correspondence 
problem. 

Theorem 5. The data string variable equality problem is undecidable. 



6 Related Work 

We are not aware of any prior decidability results for checking semantic equiv- 
alence of list processing programs, even for the restricted case of bounded data 
domains. 

The decidability of safety properties for programs with lists was investigated 
in [7]. The negative result in [7] holds for a very restricted class of programs: 
programs with only non-nested loops which do not modify the list data structure. 
Compared to the model of [7], we do not allow general traversal assignments of 
the form x :— y.next, but allow only one pointer variable curr to traverse the 
next pointers of the heap nodes. We also assume that the initial heap is acyclic 
(but analysis algorithms can detect if cycles get introduced during program 
execution) . In previous work [8] , we have presented decidability results for a class 
of concurrent list accessing programs. The two models are different: the model 
in [8] allows concurrency and nondeterminism, but is not able to capture for 
example the list reversal transduction. The restrictions in [8] are rather intricate, 
and that is what triggered this study in search of a robust automata-theoretic 
model. Extending the streaming transducer model to capture concurrency is an 
interesting research direction. There is an emerging literature on automata and 
logics over data strings [15,3] and algorithmic analysis of programs accessing 
data strings [2]. While existing literature studies acceptors and languages of 
data strings, we want to handle destructive methods that e.g. delete elements, 
and thus, a model of transducers is needed. 

A number of automata-based techniques have been proposed for shape anal- 
ysis [6,5] (see also [9] for a survey). In particular, the regular model checking 
approach [6] employs transducers to reason about heap-manipulating programs 
in the following manner. The set of heaps feasible at a program point is repre- 
sented by either a string automaton or a tree automaton, and the transformation 
on the heap due to a single statement is captured by a corresponding transducer 
model. The transformation of the entire program, then, corresponds to iterated 
composition of such transducers. Given a regular initial set of heaps, the set of 
heaps reachable after one transition will be regular. However, regular languages 
are not closed under unbounded union, so the set of all reachable heaps need 
not be regular. Consider a program that given an input list w outputs the list 



WW (note that data values do not play an important role in this transduction). 
For such a program, the iterative fixpoint procedure to compute the set of all 
reachable configurations does not terminate (in fact, the set of reachable config- 
urations is not regular, and cannot be represented by a finite-state automaton). 
However, a streaming transducer that computes such a transduction can be eas- 
ily defined. It is important to note that our decision procedures do not attempt 
to compute the set of reachable configurations (or heap contents). The literature 
on regular model checking provides several techniques for over-approximations 
of the set of reachable heaps to ensure termination, such as widening [20] and 
specialized abstractions using counters [4]. 

Analyzing programs that manipulate dynamic linked data structures is a 
widely studied problem commonly described as shape analysis [19]. Proving as- 
sertions of such programs is undecidable [12, 18], and the bulk of the literature 
consists of abstraction-based techniques for verification (see e.g. [17, 13, 10]). The 
core problem these techniques focus on is computation of invariants that often 
need to quantify over the nodes in the heap. Let us consider function Delete 
from Section 3. If the function was called with the parameter d, a natural post- 
condition is that all nodes reachable from result have values different from d. 
A quantified invariant needed to prove the postcondition could be automatically 
computed using for example the approach described in [17]. However, we empha- 
size that in contrast to the sound-but-not-complete abstraction-based methods 
for checking safety properties, our approach is sound and complete for a well- 
defined class of programs and, in addition to checking of assertions and pre/post 
conditions, we presented an algorithm for checking equivalence of programs. 



7 Conclusions 

We have introduced a streaming transducer model, and showed that it can serve 
as a foundational model of single-pass list processing programs. Our results lead 
to algorithms for checking functional equivalence of two programs, written pos- 
sibly in different programming styles, for commonly used routines for processing 
lists of data items. We are not aware of any prior decidability results for checking 
semantic equivalence of list processing programs, even for the restricted case of 
bounded data domains. 

We also believe that the streaming transducer model introduced in this paper 
is of independent theoretical interest. We have started the investigation of expres- 
siveness and related theoretical properties of the transducer model when the data 
domain is bounded. Classical string-to-string finite-state transducers need to be 
"two-way" to implement an operation such as reverse. In a subsequent paper, we 
showed that the streaming string transducer model is expressively equivalent to 
two-way transducers [1], and thus, to MSO-definable string transductions [11]. 
Learning streaming string transducers from input/output examples, and defin- 
ing a similar streaming transducer model for tree-structured data are potential 
fruitful directions for future research. 
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